An Introduction to Data Privacy in Practice
Torus Talk - MacEwan University, March 2026
Attribution
This material is adapted from the following sources:
What Are You Comfortable Sharing?
Consider different types of data:
- Your favorite type of music
- Your Instagram likes and follows
- Your e-mail
- Your name and DOB
- Your GPS location throughout the day
- Your browsing history
- Your private messages/DMs
- Which of these data would you feel comfortable sharing with an app?
- What questions would you want to ask before sharing this data?
- What if it combined two or three pieces of information?
Learning Objectives
By the end of today’s lesson, you should be able to:
Understand key terms in data privacy, including PII, pseudonymization, and anonymization
Identify direct and indirect identifiers in sample data sets
Explain why de-identification is challenging and context-dependent
Apply basic de-identification techniques (e.g., suppression, top-coding, permutation) using R
Recognize the tradeoff between data utility and privacy risk
What Happens to Your Data?
Every time you use an app, visit a website, click on a link, fill out a survey or even just scroll on your device, your data is being:
- Collected - What you click, search, watch, like or buy
- Analyzed - Used to predict your behaviour, interests or identity
- Shared or Sold - Passed to advertisers, data brokers or other companies
Why Does This Matter?
- You may be targeted with ads, content and potentially misinformation
- You could be judged or profiled based on your data (even if it’s not accurate)
- You rarely know who has your data (or what they’re doing with it)
- So what does this mean for us? Let’s explore how data can be used, what makes certain information sensitive and why it matters.
Personal Data
Data can be identifiable when:
- They contain directly identifying information.
- It’s possible to single out an individual
- It’s possible to infer information about an individual based on information in your dataset
- It’s possible to link records relating to an individual.
- De-identification is still reversible.
Scenario: Can This Data Identify You?
A fitness app shares anonymized data with researchers. The dataset includes:
- Step count per day
- General location (postal code)
- Age
- Time of day the user exercises
- Health conditions
Separately, a publicly available dataset includes information from a local running club: names, age groups and 5K race times.
The Mosaic Effect
The “Mosaic Effect” can happens when separate pieces of data, which alone don’t identify anyone, are combined from different sources to reveal personal information or identify an individual.
In 2000, 87% of the United States population was found to be identifiable using a combination of their ZIP code, gender and date of birth.
![]()
https://dataprivacylab.org/projects/identifiability/paper1.pdf
Pseudonymization and Anonymization
- Pseudonymisation and a nonymisation are techniques to de-identify personal data
- Goal: reduce linkability of data to individuals
- We will now define each of these terms
Pseudonymization
- Reduces linkability of data to individuals
- Data cannot identify individuals without additional information
- Often done by replacing direct identifiers with pseudonyms
- Link between real identifiers and pseudonyms is stored separately
- Re-identification remains possible!
Anonymization
- Data are anonymized when no individual is identifiable (directly or indirectly)
- This applies even to the data controller
- Fully anonymized data are no longer personal data
- Anonymisation is difficult to achieve in practice
When Are Data Truly anonymous?
- Only if re-identification would require unreasonable effort (factors include cost, time and available technology)
- Data are not anonymous if:
- Direct identifiers are present
- Individuals can be singled out from a group
- Re-identification possible via linking datasets (mosaic effect)
- Inference about identity is possible (e.g., through different variables)
- De-identification can be reversed
Context Matters
- Whether data are anonymous depends on:
- The context of the research
- Available external information
- Future data uses
De-identification Techniques
Techniques to deidentify your data include:
- Suppression
- Generalization
- Replacement
- Top- and bottom coding
- Adding noise
- Permutation
We will talk about each of these techniques individually.
First, let’s generate some data we can use to help illustrate these concepts.
# A tibble: 4 × 3
name age height_cm
<chr> <dbl> <dbl>
1 Joel Miller 52 182
2 Ellie Williams 19 160
3 Tommy Miller 48 185
4 Abby Anderson 28 173
Suppression
- Remove entire variables, values or records
- Used to eliminate highly identifying or unnecessary data
- Examples:
- Names, contact details, social security numbers
- GPS metadata, IP addresses, neuroimaging facial features
- Outliers or unique participants
Suppression Example
# A tibble: 4 × 2
age height_cm
<dbl> <dbl>
1 52 182
2 19 160
3 48 185
4 28 173
Generalization
- Reduces detail or granularity in the data
- Makes individuals harder to single out
- Examples:
- Convert date of birth to age, or group into ranges
- Replace address with town or region
- Recategorise rare labels into “other” or “missing”
- Abstract people or places in qualitative data (e.g., “Bob” to “[colleague]”)
Generalization Example
Here we will show an example of generalization on the age column:
# A tibble: 4 × 3
name height_cm age_group
<chr> <dbl> <chr>
1 Joel Miller 182 30+
2 Ellie Williams 160 under 30
3 Tommy Miller 185 30+
4 Abby Anderson 173 under 30
Replacement
- Swap identifying info with less informative alternatives
- Examples:
- Use pseudonyms for names (with securely stored keyfile)
- Replace with placeholders (e.g., “[redacted]”)
- Rounding numeric values
Creating Pseudonyms
- Pseudonyms should reveal nothing about the subject
- Good pseudonyms:
- Are random or meaningless strings/numbers
- Are securely managed (e.g., encrypted keyfile)
- Can be generated using tools in Excel, R, Python, SPSS
Replacement with Pseudonyms
# A tibble: 4 × 3
pseudonym age height_cm
<chr> <dbl> <dbl>
1 ID1 52 182
2 ID2 19 160
3 ID3 48 185
4 ID4 28 173
Hashing
- Hashing converts names into fixed-length, irreversible strings.
- Unlike pseudonyms, hashed values cannot be easily reversed.
- In R, we can use the
digest package (and function) to hash.
# A tibble: 4 × 3
# Rowwise:
name_hash age height_cm
<chr> <dbl> <dbl>
1 4a3e0ee26ab3fb1338e893f4d4e7244b 52 182
2 201943dd66d423ed3cce2242a75736d4 19 160
3 81699ec9483bad176eed57ee43ffa010 48 185
4 046dff9ba9cf33573396f4de8c0c0e0b 28 173
- What happens if we juse apply
digest to the name vector without using rowwise?
digest hashed the entire name column as a single object (it’s not vectorized), so mutate recycled the same hash to every row (which is not what we want).
Top- and Bottom-Coding
- Limits extreme values in quantitative data
- Recode all values above or below a threshold
- Example: all incomes above $150,000 become $150,000
- Preserves much of the dataset, but distorts distribution tails
Top-coding example
- Consider 6ft (182.88cm) is considered our maximum height threshold.
# A tibble: 4 × 3
name age height_cm
<chr> <dbl> <dbl>
1 Joel Miller 52 182
2 Ellie Williams 19 160
3 Tommy Miller 48 183.
4 Abby Anderson 28 173
Adding Noise
- Introduces randomness to protect sensitive info
- Examples:
- Add a small random amount to numeric values
- Blur images or alter voices
- Use differential privacy algorithms (advanced)
Adding Noise to Height
This adds random noise to the height variable from a normal distribution (\(\mu=0\), \(\sigma=2\)), reducing exact re-identification risk.
# A tibble: 4 × 3
name age height_cm_noisy
<chr> <dbl> <dbl>
1 Joel Miller 52 182.
2 Ellie Williams 19 160.
3 Tommy Miller 48 186.
4 Abby Anderson 28 174.
Permutation
- Swap values between individuals
- Makes linking variables across a record more difficult
- Maintains distributions, but breaks correlations
- Can limit the types of analyses possible
Permutation of Height Values
Here, the height_cm values are shuffled between individuals, preserving the overall distribution but breaking the link between person and value.
# A tibble: 4 × 3
name age height_cm_permuted
<chr> <dbl> <dbl>
1 Joel Miller 52 160
2 Ellie Williams 19 173
3 Tommy Miller 48 182
4 Abby Anderson 28 185
Key Takeaways
- Data exists on a spectrum of identifiability
- Even seemingly anonymous data can often be re-identified (e.g., mosaic effect)
- Different techniques offer varying levels of protection and utility
- Context, external data and technological capabilities all affect re-identification risk
- Responsible data handling requires both technical skill and ethical awareness
Case Study: Brogan Inc. and NIHB Data
- The Non-Insured Health Benefits (NIHB) database contains sensitive health data on First Nations use of services like prescriptions, dental care, and medical devices.
- In 2001, Health Canada began releasing de-identified NIHB pharmacy claims data to Brogan Inc., a private health consulting firm.
- Though personal identifiers were removed, community identifiers remained, and First Nations were not informed until 2007.
- Brogan sold the data to pharmaceutical companies for commercial research and marketing
- Health Canada justified the release by claiming no privacy interests remained since personally identifying information had been removed.
Kukutai, T., & Taylor, J. (2016). Indigenous data sovereignty: Toward an agenda. ANU press.
Discussion
Take 5 minutes to discuss this case in groups of 2-3. Consider these questions to reflect on:
- Was the data truly de-identified?
- Should de-identified data still require community consent before being shared or sold?
- What are the limits of simply removing names and IDs from a dataset?
- How can we measure whether a dataset is truly “safe” to release?
Learning Objectives
By the end of today’s lesson, you should be able to:
- Define identifiers, quasi-identifiers and sensitive attributes in data sets
- Explain the limitations of basic deidentification methods
- Describe the concepts of \(k\)-anonymity, \(l\)-diversity and \(t\)-closeness
- Apply \(k\)-anonymity and \(l\)-diversity to de-identify data
- Understand the basic idea of differential privacy and its significance
Why basic deidentification isn’t always enough
Last class, we introduced some techniques for deidentification such as suppression and generalization.
However, individuals can often be re-identified using other information.
As datasets become more detailed and linkable, privacy risks increase.
Statistical methods are needed to ensure meaningful deidentification while preserving data utility.
Statistical approaches to deidentification
- \(k\)-anonymity
- \(l\)-diversity
- \(t\)-closeness
- Differential privacy (advanced)
Overview of privacy models
- \(k\)-anonymity, \(l\)-diversity, and \(t\)-closeness are statistical approaches that quantify the level of identifiability within a tabular dataset.
- They focus on how variables combined can lead to identification.
- These approaches are complementary: a dataset can be simultaneously \(k\)-anonymous, \(l\)-diverse, and \(t\)-close, where \(k\), \(l\), and \(t\) represent numeric thresholds.
- \(k\)-anonymity, \(l\)-diversity, and \(t\)-closeness are typically used to de-identify tabular datasets before sharing.
- They work best on relatively large datasets, where enough observations are present to preserve useful detail while still protecting privacy.
Identifiers, Quasi-Identifiers, and Sensitive Attributes
Privacy models distinguish between three types of variables:
Identifiers: Direct identifiers such as names, student numbers, email addresses.
Quasi-Identifiers: Indirect identifiers that can lead to identification when combined with other quasi-identifiers or external data.
- Examples: age, sex, place of residence, physical characteristics, timestamps, etc.
Sensitive Attributes: Variables of interest that need protection and cannot be altered as they are key outcomes.
- Examples: Medical condition, Income, etc.
Importance of Correct Variable Categorization
- Correctly categorizing variables into identifiers, quasi-identifiers, and sensitive attributes is crucial.
- This categorization determines how to de-identify your dataset effectively using \(k\)-anonymity, \(l\)-diversity, and \(t\)-closeness.
- Now, let’s discuss each of these techniques in detail…
\(k\)-anonymity
- A data set is \(k\)-anonymous if each observation cannot be distinguished from at least \(k-1\) other observations based on the quasi-identifiers.
- This can be achieved through generalization, suppression and sometimes top- or bottom-coding of data values.
- Applying \(k\)-anonymity makes it more difficult for an attacker to single out or re-identify specific individuals.
- It also helps reduce the risk of the mosaic effect, where combining data points could lead to identification.
Making a data set \(k\)-anonymous
- Identify variables as identifiers, quasi-identifiers and sensitive attributes.
- Choose a value for \(k\).
- Aggregate or transform the data so each combination of quasi-identifiers occurs at least k times.
Choosing \(k\)
- There is no single correct value for \(k\)!
- Higher \(k\) increases privacy, but reduces data detail and utility.
- The choice depends on promises made to data subjects and acceptable risk levels.
Example data
- Age and city are quasi-identifiers, and salary is considered a sensitive attribute.
|
Age
|
City
|
Salary
|
|
38
|
Calgary
|
91,000
|
|
37
|
Toronto
|
92,000
|
|
31
|
Vancouver
|
82,000
|
|
48
|
Calgary
|
115,000
|
|
39
|
Vancouver
|
118,000
|
|
37
|
Calgary
|
97,000
|
|
34
|
Toronto
|
98,000
|
|
33
|
Vancouver
|
89,000
|
|
32
|
Toronto
|
108,000
|
|
45
|
Calgary
|
95,000
|
\(k=2\)
|
Age Range
|
City
|
Salary Range
|
|
30–39
|
Calgary
|
90,000–99,999
|
|
30–39
|
Toronto
|
90,000–99,999
|
|
30–39
|
Vancouver
|
80,000–89,999
|
|
40–49
|
Calgary
|
110,000–119,999
|
|
30–39
|
Vancouver
|
110,000–119,999
|
|
30–39
|
Calgary
|
90,000–99,999
|
|
30–39
|
Toronto
|
90,000–99,999
|
|
30–39
|
Vancouver
|
80,000–89,999
|
|
30–39
|
Toronto
|
100,000–109,999
|
|
40–49
|
Calgary
|
90,000–99,999
|
\(l\)-diversity
- \(l\)-diversity is an extension of \(k\)-anonymity that ensures sufficient variation in a sensitive attribute.
- This is important because if all individuals within a group share the same sensitive value, there is still a risk of inference.
- Although these data are \(2\)-anonymous, we can still infer that any 30-39 year old from Calgary who participated earns between 90-99k.
|
Age Range
|
City
|
Salary Range
|
|
30–39
|
Calgary
|
90,000–99,999
|
|
30–39
|
Toronto
|
90,000–99,999
|
|
30–39
|
Vancouver
|
80,000–89,999
|
|
40–49
|
Calgary
|
110,000–119,999
|
|
30–39
|
Vancouver
|
110,000–119,999
|
|
30–39
|
Calgary
|
90,000–99,999
|
|
30–39
|
Toronto
|
90,000–99,999
|
|
30–39
|
Vancouver
|
80,000–89,999
|
|
30–39
|
Toronto
|
100,000–109,999
|
|
40–49
|
Calgary
|
90,000–99,999
|
\(l\)-diversity
- The approach requires at least \(l\) different values for the sensitive attribute within each combination of quasi-identifiers.
- Again, there is no perfect value for \(l\) (typically \(1< l \leq k\)).
- With \(l=2\), that means that for each combination of Age Range and City, there are at least 2 distinct Salary Ranges.
|
Age Range
|
City
|
Salary Range
|
|
30–39
|
-
|
90,000–99,999
|
|
30–39
|
-
|
90,000–99,999
|
|
30–39
|
-
|
80,000–89,999
|
|
40–49
|
Calgary
|
110,000–119,999
|
|
30–39
|
-
|
110,000–119,999
|
|
30–39
|
-
|
90,000–99,999
|
|
30–39
|
-
|
90,000–99,999
|
|
30–39
|
-
|
80,000–89,999
|
|
30–39
|
-
|
100,000–109,999
|
|
40–49
|
Calgary
|
90,000–99,999
|
\(t\)-closeness
- \(t\)-closeness builds on k-anonymity and l-diversity by requiring that the distribution of the sensitive attribute within each group of quasi-identifiers is close to the distribution in the full dataset.
- This prevents situations where a sensitive value is overly dominant in a group, which could allow re-identification through skewed distributions.
- For example, in a dataset with Age and Sex as quasi-identifiers and Income as the sensitive attribute, applying t-closeness with \(t = 0.1\) means the income distribution in each group must stay within 10% of the overall income distribution.
- In this course, we will focus mainly on \(k\)-anonymity and \(l\)-diversity as \(t\)-closeness can get complicated to implement.
There are still issues…
Even though the data is de-identified, some sensitive patterns can still leak through.
In the example we discussed, both individuals are grouped into the same age range and city.
While they are in different salary ranges and exact values are hidden, the range is still quite narrow.
Due to the similarity of the salary ranges, one can still infer that both individuals earn between $90,000 and $119,999.
|
Age Range
|
City
|
Salary Range
|
|
40–49
|
Calgary
|
110,000–119,999
|
|
40–49
|
Calgary
|
90,000–99,999
|
Differential privacy
- So, we may need more sophisticated tools to privatize our data…
- Differential privacy is a mathematical approach to protecting privacy
- It ensures algorithm results are nearly the same whether one person’s data is included or not
- Differential privacy makes it hard to tell if any individual’s data is in the dataset, which protects individual’s information (even with unusual or unique data)
- Differential privacy is a complex topic and goes beyond the scope of this course
- For a clear and accessible explanation, check out this short video:
iClicker Question 1
Given the data, which field(s) could you generalize to help achieve k = 3 anonymity?
| 29 |
13053 |
Flu |
| 27 |
13068 |
Flu |
| 28 |
13068 |
Cold |
| 45 |
14853 |
Diabetes |
| 46 |
14853 |
Diabetes |
| 47 |
14853 |
Cancer |
- A. Generalize Age into age ranges (e.g., 20–29, 40–49)
- B. Suppress Disease entirely
- C. Generalize ZIP Code to first 3 digits (e.g., 130, 148)
- D. Generalize Age into age ranges (e.g., 20–29, 40–49) and ZIP code to first 3 digits (e.g., 130, 148)
- E. It’s already \(k=3\) anonymous
iClicker Question 2
Which of the following datasets violates \(k = 2\) anonymity?
Option A
| 34 |
M |
02138 |
| 34 |
M |
02138 |
| 34 |
F |
02139 |
Option B
| 22 |
F |
10011 |
| 22 |
F |
10011 |
| 22 |
F |
10011 |
Option C
| 30–39 |
* |
021** |
| 30–39 |
* |
021** |
| 30–39 |
* |
021** |
- A. Only A
- B. Only B
- C. Only C
- D. A and B
Question 3
Consider this 3-anonymous dataset. Is it also 2-diverse with respect to “Condition”?
| 20–29 |
130** |
Flu |
| 20–29 |
130** |
Flu |
| 20–29 |
130** |
Flu |
| 30–39 |
148** |
Cold |
| 30–39 |
148** |
Cold |
| 30–39 |
148** |
Cancer |
- A. Yes, both groups have 2 or more different values
- B. No, one group violates l-diversity
- C. Yes, because the dataset is already k-anonymous
- D. No, both groups have only one distinct value
Key Takeaways
- Removing direct identifiers alone does not guarantee privacy
- Quasi-identifiers can lead to re-identification if not protected
- \(k\)-anonymity makes each record indistinguishable from at least \(k - 1\) others
- \(l\)-diversity improves protection by promoting diversity in sensitive attributes
- Differential privacy offers mathematical privacy guarantees
- Choosing privacy parameters involves balancing risk and data utility